Search CORE

36 research outputs found

A Unified Coded Deep Neural Network Training Strategy Based on Generalized PolyDot Codes for Matrix Multiplication

Author: Bai Ziqian
Dutta Sanghamitra
Grover Pulkit
Jeong Haewon
Low Tze Meng
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 26/11/2018
Field of study

This paper has two contributions. First, we propose a novel coded matrix multiplication technique called Generalized PolyDot codes that advances on existing methods for coded matrix multiplication under storage and communication constraints. This technique uses "garbage alignment," i.e., aligning computations in coded computing that are not a part of the desired output. Generalized PolyDot codes bridge between Polynomial codes and MatDot codes, trading off between recovery threshold and communication costs. Second, we demonstrate that Generalized PolyDot can be used for training large Deep Neural Networks (DNNs) on unreliable nodes prone to soft-errors. This requires us to address three additional challenges: (i) prohibitively large overhead of coding the weight matrices in each layer of the DNN at each iteration; (ii) nonlinear operations during training, which are incompatible with linear coding; and (iii) not assuming presence of an error-free master node, requiring us to architect a fully decentralized implementation without any "single point of failure." We allow all primary DNN training steps, namely, matrix multiplication, nonlinear activation, Hadamard product, and update steps as well as the encoding/decoding to be error-prone. We consider the case of mini-batch size

B=1

, as well as

B>1

, leveraging coded matrix-vector products, and matrix-matrix products respectively. The problem of DNN training under soft-errors also motivates an interesting, probabilistic error model under which a real number

(P,Q)

MDS code is shown to correct

P-Q-1

errors with probability

1

as compared to

\lfloor \frac{P-Q}{2} \rfloor

for the more conventional, adversarial error model. We also demonstrate that our proposed strategy can provide unbounded gains in error tolerance over a competing replication strategy and a preliminary MDS-code-based strategy for both these error models.Comment: Presented in part at the IEEE International Symposium on Information Theory 2018 (Submission Date: Jan 12 2018); Currently under review at the IEEE Transactions on Information Theor

arXiv.org e-Print Archive

Crossref

Editorial: Scalable Bioinformatics:Methods, Software Tools, and Hardware Architectures

Author: Alachiotis Nikolaos
Low Tze Meng
Pavlidis Pavlos
Publication venue
Publication date: 05/01/2022
Field of study

PubMed Central

University of Twente Research Information

Exploration of Fine-Grained Parallelism for Load Balancing Eager K-truss on GPU and CPU

Author: Blanco Mark
Kim Kyungjoo
Low Tze Meng
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 16/09/2020
Field of study

In this work we present a performance exploration on Eager K-truss, a linear-algebraic formulation of the K-truss graph algorithm. We address performance issues related to load imbalance of parallel tasks in symmetric, triangular graphs by presenting a fine-grained parallel approach to executing the support computation. This approach also increases available parallelism, making it amenable to GPU execution. We demonstrate our fine-grained parallel approach using implementations in Kokkos and evaluate them on an Intel Skylake CPU and an Nvidia Tesla V100 GPU. Overall, we observe between a 1.261. 48x improvement on the CPU and a 9.97-16.92x improvement on the GPU due to our fine-grained parallel formulation.Comment: 2019 IEEE High Performance Extreme Computing Conference (HPEC

arXiv.org e-Print Archive

Crossref

Analytical Modeling is Enough for High Performance BLIS

Author: Igual Francisco D.
Low Tze Meng
Quintana-Orti Enrique S.
Smith Tyler M.
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 01/01/2016
Field of study

We show how the BLAS-like Library Instantiation Software (BLIS) framework, which provides a more detailed layering of the GotoBLAS (now maintained as OpenBLAS) implementation, allows one to analytically determine tuning parameters for high-end instantiations of the matrix-matrix multiplication. This is of both practical and scientific importance, as it greatly reduces the development effort required for the implementation of the level-3 BLAS while also advancing our understanding of how hierarchically layered memories interact with high-performance software. This allows the community to move on from valuable engineering solutions (empirically autotuning) to scientific understanding (analytical insight).This research was sponsored in part by NSF grants ACI-1148125/1340293 and CCF-0917167. Enrique S. Quintana-Ortí was supported by project TIN2011-23283 of the Ministerio de Ciencia e Innovacióon and FEDER. Francisco D. Igual was supported by project TIN2012-32180 of the Ministerio de Ciencia e Innovación

LAReferencia - Red Federada de Repositorios Institucionales de Publicaciones Científicas Latinoamericanas

Repositori Institucional de la Universitat Jaume I

Towards an Objective Metric for the Performance of Exact Triangle Count

Author: Blanco Mark P.
Low Tze Meng
McMillan Scott
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 29/09/2021
Field of study

The performance of graph algorithms is often measured in terms of the number of traversed edges per second (TEPS). However, this performance metric is inadequate for a graph operation such as exact triangle counting. In triangle counting, execution times on graphs with a similar number of edges can be distinctly different as demonstrated by results from the past Graph Challenge entries. We discuss the need for an objective performance metric for graph operations and the desired characteristics of such a metric such that it more accurately captures the interactions between the amount of work performed and the capabilities of the hardware on which the code is executed. Using exact triangle counting as an example, we derive a metric that captures how certain techniques employed in many implementations improve performance. We demonstrate that our proposed metric can be used to evaluate and compare multiple approaches for triangle counting, using a SIMD approach as a case study against a scalar baseline.Comment: 6 Pages, 2020 IEEE High Performance Extreme Computing Conference(HPEC

arXiv.org e-Print Archive

Crossref

Reformulating the direct convolution for high-performance deep learning inference on ARM processors

Author: Barrachina Mir Sergio
Castelló Adrián
Dolz Manuel F.
Low Tze Meng
Martinez Hector
Quintana-Orti Enrique S.
Tomás Domínguez Andrés Enrique
Upasana Sridhar
Publication venue: 'Elsevier BV'
Publication date: 20/12/2022
Field of study

We present two high-performance implementations of the convolution operator via the direct algorithm that outperform the so-called lowering approach based on the im2col transform plus the gemm kernel on an ARMv8-based processor. One of our methods presents the additional advantage of zero-memory overhead while the other employs an additional yet rather moderate workspace, substantially smaller than that required by the im2col+gemm solution. In contrast with a previous implementation of a similar zero-memory overhead direct convolution, this work exhibits the key advantage of preserving the conventional NHWC data layout for the input/output activations of the convolution layers.Funding for open access charge: CRUE-Universitat Jaume

Repositori Institucional de la Universitat Jaume I